Implement 4over6 NVFP4 recipe by zianglih · Pull Request #2972 · NVIDIA/TransformerEngine

zianglih · 2026-05-09T03:50:20Z

Description

Implement 4over6 nvfp4 from:

Paper: https://arxiv.org/abs/2512.02010
Code: https://github.com/mit-han-lab/fouroversix

FlashInfer PR:

Support 4over6 nvfp4 for quantizer and fused MoE flashinfer-ai/flashinfer#3264

Enable per-block map-to-4 versus map-to-6 candidate selection for 1D/2D NVFP4 quantization in the NVFP4BlockScaling recipe. This mode currently requires RHT and stochastic rounding to be disabled. Both original per-tensor scaling and row-scaling NVFP4 introduced by #2931 are supported.

This PR also fixes a few minor bugs for row-scaled NVFP4 from #2931.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Adds scoped NVFP4 4over6 control through NVTE_NVFP4_4OVER6=weights|activations|all, with unset preserving existing behavior, and threads the selected scope through recipes, quantizers, tensor metadata, split quantization, single-tensor quantization, and C++ tensor/config APIs.
Implements 1D & 2D NVFP4 4over6 quantization in the existing NVFP4 CUDA paths by comparing TE-style map-to-4 and map-to-6 FP4 candidates with the original 4over6 MSE rule, choosing map-to-6 on ties, honoring NVTE_USE_FAST_MATH, and rejecting unsupported combinations such as stochastic rounding, grouped tensors, and RHT.
Updates dequantization and NVFP4 GEMM scaling to respect per-tensor 4over6 metadata, using 256-based normalization for 4over6 tensors and 448-based normalization for regular NVFP4 tensors without requiring callers to do hidden rescaling.
Extends the Python reference implementation to mirror the intended ground truth, meaning TE-style candidate quantization plus original 4over6 MSE/compare logic, and uses this reference for bitwise exact tests where fast math is disabled.
Expands C++ and Python coverage across exact NVFP4 quantization, GEMM, dequantization, recipe scope resolution, quantized tensor handling, numerics, sanity, CUDA graph, torch compile, CPU offload, fusible ops, and backward override paths, while documenting the new environment variable and known unsupported modes.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-05-09T03:55:44Z

Greptile Summary

This PR adds NVFP4 4over6 quantization support to TransformerEngine's NVFP4BlockScaling recipe. For each 1x16 block, it quantizes with both a map-to-4 candidate (1.5x expanded scale) and a map-to-6 candidate (normal scale), then selects the lower-error option (tie goes to map-to-6), mirroring the approach from the fouroversix paper/repo.

Adds a new CUDA kernel (quantize_4over6_nvfp4.cuh) with pipeline-staged shared-memory loading, per-block candidate comparison (MAE or MSE error), 2D/1D support, and row-scaled amax support; pairs it with updated dequantization and GEMM scale kernels that parameterize the E4M3 normalization divisor per tensor.
Threads the 4over6 mode, E4M3 max bound, and error-mode metadata through the full Python to C++ stack with explicit rejection guards for stochastic rounding, RHT, and grouped tensors; extends the Python reference quantizer and adds broad test coverage.

Confidence Score: 5/5

The PR is safe to merge. The 4over6 feature is entirely opt-in and isolated behind the new nvfp4_4over6 recipe field; all existing NVFP4 paths are untouched when the flag is unset.

The kernel implementation is well-guarded with explicit rejection of stochastic rounding, RHT, and grouped quantization at multiple call-site layers. The two findings are limited to an inconsistency in a property setter that all current internal call sites avoid, and a missing secondary validation inside quantize_4over6 that is already covered by quantize_fwd_helper for all real callers.

The new quantize_4over6_nvfp4.cuh kernel and the grouped_tensor_storage.py property setter are the two places worth a careful second read.

Important Files Changed

Filename	Overview
transformer_engine/common/cast/nvfp4/quantize_4over6_nvfp4.cuh	New 671-line CUDA kernel implementing 4over6 candidate comparison; pipeline-staged async shared-memory load, correct 2D warp-level reductions, and proper sm_100 guard. Missing output->nvfp4_4over6 == true validation when called directly (bypassing quantize_fwd_helper).
transformer_engine/common/cast/dispatch/quantize.cuh	Adds tensor/config consistency checks and dispatches to quantize_4over6 when nvfp4_4over6 is set; three dispatch sites all correctly guard and branch.
transformer_engine/common/cast/nvfp4/dequantize_nvfp4.cuh	Parameterizes E4M3_MAX as a compile-time template argument; ROW_SCALED_NVFP4 also promoted to template; correct dispatch for both 256 and 448 bounds.
transformer_engine/common/recipe/nvfp4.cu	Correctly plumbs per-tensor fp8_max_A/fp8_max_B into the GEMM scale kernel instead of the old hardcoded 448 constant.
transformer_engine/pytorch/csrc/quantizer.cpp	NVFP4Quantizer constructor and quantize_impl correctly thread nvfp4_use_4over6, nvfp4_e4m3_max, nvfp4_4over6_err_mode and err_use_fast_math through quant_config; RHT/stochastic-rounding guards present.
transformer_engine/pytorch/csrc/extensions/cast.cpp	All split-quantize helpers and bulk-alloc paths propagate 4over6 metadata; grouped and RHT paths correctly reject 4over6 with clear error messages.
transformer_engine/pytorch/tensor/nvfp4_tensor.py	nvfp4_use_4over6 and nvfp4_e4m3_max threaded through new, copy, reduce_ex, all-gather metadata, and view/reshape autograd functions.
transformer_engine/pytorch/tensor/storage/grouped_tensor_storage.py	Adds nvfp4_use_4over6 and nvfp4_e4m3_max properties; nvfp4_e4m3_max setter does not replicate the nvfp4_use_4over6 guard from _initialize_storage_fields.
transformer_engine/pytorch/custom_recipes/quantization_ref_nvfp4.py	Python reference 4over6 quantizer correctly mirrors CUDA candidate-compare logic; dimension handling for row_scaled and 2D cases is consistent with the kernel.
transformer_engine/common/include/transformer_engine/transformer_engine.h	New kNVTENVFP44Over6 (9) and kNVTENVFP4E4M3Max (10) tensor params plus four new QuantizationConfigAttributes; getter/setter helpers correctly encode/decode values.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Python Recipe NVFP4BlockScaling nvfp4_4over6 scope] --> B[NVFP4BlockScalingRecipeState resolve nvfp4_use_4over6]
    B --> C[NVFP4Quantizer nvfp4_use_4over6 bool nvfp4_e4m3_max int]
    C --> D{quantize path}
    D -- single tensor --> E[quantize_impl set quant_config fields reject RHT + stochastic_rounding]
    D -- split quant --> F[split_quantize_nvfp4_impl reject RHT set per-config fields]
    D -- grouped --> G[group_quantize_nvfp4_impl reject 4over6]
    E --> H[quantize_fwd_helper check tensor/config consistency]
    F --> H
    H -- nvfp4_use_4over6=true --> I[quantize_4over6 E4M3_MAX switch ErrMode switch]
    H -- nvfp4_use_4over6=false --> J[existing quantize_transpose kernels]
    I --> K[quantize_4over6_kernel load tile async compute ScalePair map4+map6 pick lower error write selected scale+data]
    K --> L[NVFP4Tensor _nvfp4_use_4over6 _nvfp4_e4m3_max]
    L --> M[dequantize_fp4_kernel E4M3_MAX template]
    L --> N[nvte_nvfp4_compute_per_tensor_scale fp8_max from tensor]

_{Reviews (8): Last reviewed commit: "Remove 4over6 benchmark" | Re-trigger Greptile}

zianglih · 2026-05-11T07:16:24Z

Functionality has been verified by internal RL experiments.
We may want to allow separate 4over6 config for weights and activations, maybe NVTE_NVFP4_ENABLE_4OVER6=weights|activations|all.

zianglih · 2026-05-11T21:17:24Z

Need to rebase.

timmoon10 · 2026-05-11T23:44:32Z

   *  its values are populated during quantization.
   */
  kNVTERowScaledNVFP4 = 8,
+  kNVTENVFP44Over6 = 9, /*!< Whether an NVFP4 tensor uses 4over6 scaling */


We are specifying this redundantly in NVTETensor and NVTEQuantizationConfig. If this option can be isolated to quantization, then we should not add clutter to the tensor. If the option is needed for downstream consumers (dequantization, GEMM), then it should be treated as part of the tensor data. I'm not especially familiar, but 4over6 seems like it should be specific to quantization.

4over6 changes the decode convention from 1 / (6 * 448) to 1 / (6 * 256). Therefore, for our current representation 4over6 is part of the tensor data contract, not just a quantization option.

timmoon10 · 2026-05-12T00:04:07Z

  using namespace detail;
-  constexpr float fp8_max = TypeExtrema<fp8e4m3>::max;  // 448.0f;
-  constexpr float fp4_max = TypeExtrema<fp4e2m1>::max;  // 6.0f;
+  constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max;  // 448.0f;


How much benefit does changing the FP8 scale have on convergence? If we don't see a clear benefit, then it would be nicer to use the same scale for 4over6 and non-4over6. That way keep can keep this logic confined to quantization, and downstream consumers are completely unaffected.

If there is an impact on training quality, we should still consider disentangling the FP8 scaling from 4over6. I don't see why other NVFP4 recipes might not benefit from tweaking the scaling.

From the original paper:

Finally, we make one modification to the computation of the tensor scale α (Equation 1) when
quantizing to NVFP4 with 4/6. When MFP4 ×MFP8 is used to compute the tensor scale, it ensures
that all quantized values will be less than 6 ×448. However, this makes it impossible to select a scale
of 4 for the blocks that contain a tensor’s largest values, because the block’s scale would need to be
448 × 6/4 = 672, which would overflow since 448 is the maximum value that can be represented by
E4M3. As a result, when computing the tensor scale, we replace MFP8 to 256 in Equation 1, since
256 is the largest E4M3 that can be multiplied by 6/4 and represented without error in E4M3, as 384.

Also:

In Section 3.1, we propose calculating the FP32 global tensor scale using 256 as the maximum FP8
E4M3 value rather than the default of 448, as this allows blocks with a tensor’s largest value to have
the option to have a largest FP4 value of 4. In Figure 6, we find that this provides a marginal benefit
over using the standard tensor scale calculation. Even though this adjustment only affects a small
number of large values, this performance gain may come from the fact that larger activation values
can have an outsize impact on model performance. This adjustment is incorporated into the remaining
experiments in this section.

Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.

We did find the use of 256 to calculate the second level scaling factor helped convergence vs 448, but only slightly.

It's possible that the premise of the paper's argument (prevent saturations when 4 scaling effectively multiplies the block decode scale by 1.5) is sound, but a value larger than 256 can achieve this and the perfect representation of the block with the global amax value with both scalings is not worth the extra range loss.

let me make 256 scaling a separate env var disabled by default

448, 320, 288, 256 are all potential candidates for map-to-6:

448: effectively disable map-to-4 option above 256, preserve range

320, 288: map-to-4 uses 448, no precise 1.5x

256: map-to-4 uses 384, precise 1.5x

For now let me refactor the interface to NVTE_NVFP4_4OVER6_E4M3="448"|"256", default to "448" and dispatches to a number in template parameter in C++ code instead of a boolean toggle. People can add support for other values or make it more generic (like directly parsing the env var digits) in the future.

NVTE_NVFP4_4OVER6_E4M3_USE_256=weights|activations|all is a cleaner pattern and allows separate configuration.

timmoon10 · 2026-05-12T00:25:11Z

This test is okay, but it would provide much more confidence if the NVFP4 quantization tests compared against a CPU reference impl.

Extended tests/cpp/operator/test_cast_nvfp4_transpose.cu coverage in 3bb42b1.

negvet · 2026-05-12T15:22:36Z

+    nvfp4_4over6 : {None, 'weights', 'activations', 'all'}, default = None
+             Select tensors that use NVFP4 4over6. In this mode NVFP4
+             quantization evaluates per-block map-to-4 and map-to-6 candidates
+             and chooses the one with lower MSE. Ties choose map-to-6. The


We need both MSE (better for post-training?) and MAE (better for pre-training as per our internal studies) to be supported, with MAE as the default.

negvet · 2026-05-12T15:40:29Z

  using namespace detail;
-  constexpr float fp8_max = TypeExtrema<fp8e4m3>::max;  // 448.0f;
-  constexpr float fp4_max = TypeExtrema<fp4e2m1>::max;  // 6.0f;
+  constexpr float fp8_max = USE_4OVER6 ? 256.0f : TypeExtrema<fp8e4m3>::max;  // 448.0f;


Not sure if there are internal or external studies about the convergence. But this is required to make it work. We need the largest value that is smaller than 448/1.5 and which is itself, and its multiplication by 1.5 is represented by E4M3 exactly. This would help to avoid quantization noise on both map to 4 and map to 6 paths.

Signed-off-by: Ziang Li <ziangli@umich.edu>

negvet · 2026-05-13T09:17:07Z

What is the e2e step time increase with 4/6 on some typical workload?

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih · 2026-05-13T09:48:40Z

Major changes from last time:

Use standalone quantization kernel implementation instead of folding into existing code. 4over6 quantize is very fp32 compute bound (Implement 4over6 NVFP4 recipe #2972 (comment) and Implement 4over6 NVFP4 recipe #2972 (comment)) and latency hiding techniques in TE original nvfp4 quant kernels lead to higher register pressure and worse performance. There is not much we could do regarding fp32 arithmetic bottleneck without changing heuristics. I think even if we want to further optimize perf/heuristics we should do it in a separate PR and extend as new error modes. cc @Oleg-Goncharov @kwyss-nvidia
Allow both 448 and 256 configurations. The user can config by setting NVTE_NVFP4_4OVER6_E4M3_USE_256. However, all underlying implementations encodes nvfp4_e4m3_max and E4M3_MAX template parameter instead of a boolean flag so we can easily extend other values in the future. cc @timmoon10 @kwyss-nvidia @negvet
Add and default to MAE error mode. cc @negvet
For 4over6 quantize cpp test, we now don't check map-to-4 vs map-to-6 selection and accept either to be bitwise exact. This avoids numerics drift from CPU arch. Python test still has strict candidate selection coverage. cc @Oleg-Goncharov

zianglih marked this pull request as draft May 9, 2026 03:50

zianglih changed the title ~~Implement 4over6 nvfp4~~ Implement 4over6 nvfp4 recipe May 9, 2026

zianglih mentioned this pull request May 9, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

30 tasks

zianglih changed the title ~~Implement 4over6 nvfp4 recipe~~ Implement 4over6 NVFP4 recipe May 9, 2026

greptile-apps Bot reviewed May 9, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/csrc/extensions/cast.cpp Outdated

Comment thread transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu Outdated

Comment thread transformer_engine/common/recipe/__init__.py

ziang-and force-pushed the 4over6 branch from f3f4127 to 9ff4c3a Compare May 9, 2026 08:53

zianglih commented May 9, 2026

View reviewed changes

Comment thread tests/pytorch/test_sanity.py Outdated

zianglih mentioned this pull request May 9, 2026

Support 4over6 nvfp4 for quantizer and fused MoE flashinfer-ai/flashinfer#3264

Open

5 tasks

ziang-and force-pushed the 4over6 branch from a989400 to 097c7aa Compare May 10, 2026 09:11

zianglih marked this pull request as ready for review May 10, 2026 09:36

ptrendx assigned Oleg-Goncharov May 11, 2026

ptrendx requested a review from negvet May 11, 2026 17:12

ptrendx added community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. fp4 labels May 11, 2026

zianglih marked this pull request as draft May 11, 2026 21:17

ziang-and force-pushed the 4over6 branch from 53aad5e to 1dcd003 Compare May 11, 2026 21:41

zianglih marked this pull request as ready for review May 11, 2026 22:36

timmoon10 requested changes May 12, 2026

View reviewed changes

timmoon10 reviewed May 12, 2026

View reviewed changes

zianglih marked this pull request as draft May 12, 2026 02:01

zianglih marked this pull request as ready for review May 12, 2026 06:45

zianglih requested a review from timmoon10 May 12, 2026 06:47

zianglih marked this pull request as draft May 12, 2026 09:03

ziang-and force-pushed the 4over6 branch from 4f7790a to cc2f378 Compare May 12, 2026 09:17

zianglih marked this pull request as ready for review May 12, 2026 10:10

negvet requested changes May 12, 2026

View reviewed changes

Oleg-Goncharov self-requested a review May 12, 2026 16:37

zianglih added 20 commits May 13, 2026 00:36

Refactor test_fusible_ops

3252d4e

Signed-off-by: Ziang Li <ziangli@umich.edu>

Refactor ref and extend cpp test

3f33c1d

Signed-off-by: Ziang Li <ziangli@umich.edu>

Clean up cpp test

8607e03

Signed-off-by: Ziang Li <ziangli@umich.edu>

Minor comment

d3dbf34

Signed-off-by: Ziang Li <ziangli@umich.edu>

Drop doc

565f33f

Signed-off-by: Ziang Li <ziangli@umich.edu>

Explicit handle conditional smem buffer

54b4da8

Signed-off-by: Ziang Li <ziangli@umich.edu>

Further clean up

fa09200

Signed-off-by: Ziang Li <ziangli@umich.edu>

More templates

e57e8be

Signed-off-by: Ziang Li <ziangli@umich.edu>

Simplify cpp

a1df319

Signed-off-by: Ziang Li <ziangli@umich.edu>

Drop write back lifting

21720da

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add MAE and dedicated fast math env var

b1d073a

Signed-off-by: Ziang Li <ziangli@umich.edu>

Harden cpp test

0392708

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add warning and err fast math coverage

0b77a37

Signed-off-by: Ziang Li <ziangli@umich.edu>

Fold test case and clean up cpp test

81e579e

Signed-off-by: Ziang Li <ziangli@umich.edu>

Initial 448 vs 256 implementation

1e311ef

Signed-off-by: Ziang Li <ziangli@umich.edu>

Use e4m3 max instead of boolean, more template

38a1c4c

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add benchmark script and minor optimization

3cdd9d9

Signed-off-by: Ziang Li <ziangli@umich.edu>

Use standalone kernels

7deba75

Signed-off-by: Ziang Li <ziangli@umich.edu>

Use cp async

93dbf2b

Signed-off-by: Ziang Li <ziangli@umich.edu>

Add benchmark script

8819d12

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the 4over6 branch from e85cdbf to 8819d12 Compare May 13, 2026 07:38

Minor fix after rebase

24e417b

Signed-off-by: Ziang Li <ziangli@umich.edu>

ziang-and force-pushed the 4over6 branch from 38fffbb to 24e417b Compare May 13, 2026 08:54

zianglih added 2 commits May 13, 2026 02:36

Naming consistency

472e5b8

Signed-off-by: Ziang Li <ziangli@umich.edu>

Remove 4over6 benchmark

83e2308

Signed-off-by: Ziang Li <ziangli@umich.edu>

zianglih marked this pull request as ready for review May 13, 2026 09:48

zianglih requested review from ksivaman and ptrendx as code owners May 13, 2026 09:48

Conversation

zianglih commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zianglih commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zianglih commented May 11, 2026

Uh oh!

timmoon10 May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

negvet commented May 13, 2026

Uh oh!

zianglih commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zianglih commented May 9, 2026 •

edited

Loading

greptile-apps Bot commented May 9, 2026 •

edited

Loading

zianglih commented May 11, 2026 •

edited

Loading

timmoon10 May 11, 2026 •

edited

Loading

zianglih commented May 13, 2026 •

edited

Loading